在此 notebook 中,我们将使用 sklearn 对鸢尾花数据集执行层次聚类。该数据集包含 4 个维度/属性和 150 个样本。每个样本都标记为某种鸢尾花品种(共三种)。
在此练习中,我们将忽略标签和基于属性的聚类,并将不同层次聚类技巧的结果与实际标签进行比较,看看在这种情形下哪种技巧的效果最好。然后,我们将可视化生成的聚类层次。
In [1]:
from sklearn import datasets
iris = datasets.load_iris()
查看数据集中的前 10 个样本
In [2]:
iris.data[:10]
Out[2]:
In [3]:
iris.target
Out[3]:
现在使用 sklearn 的 AgglomerativeClustering
进行层次聚类
In [4]:
from sklearn.cluster import AgglomerativeClustering
# Hierarchical clustering
# Ward is the default linkage algorithm, so we'll start with that
ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(iris.data)
并且尝试完全连接法和平均连接法
练习:
complete_pred
中avg_pred
中注意:请查看 AgglomerativeClustering
文档以查找要作为 linkage
值传递的合适值
In [ ]:
# Hierarchical clustering using complete linkage
# TODO: Create an instance of AgglomerativeClustering with the appropriate parameters
complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
# Fit & predict
# TODO: Make AgglomerativeClustering fit the dataset and predict the cluster labels
complete_pred = complete.fit_predict(iris.data)
# Hierarchical clustering using average linkage
# TODO: Create an instance of AgglomerativeClustering with the appropriate parameters
avg = AgglomerativeClustering(n_clusters=3, linkage="average")
# Fit & predict
# TODO: Make AgglomerativeClustering fit the dataset and predict the cluster labels
avg_pred = vg.fit_predict(iris.data)
为了判断哪个聚类结果与样本的原始标签更匹配,我们可以使用 adjusted_rand_score
,它是一个外部聚类有效性指标,分数在 -1 到 1 之间,1 表示两个聚类在对数据集中的样本进行分组时完全一样(无论每个聚类分配的标签如何)。
在这门课程的稍后部分会讨论聚类有效性指标。
In [ ]:
from sklearn.metrics import adjusted_rand_score
ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
练习:
In [ ]:
# TODO: Calculated the adjusted Rand score for the complete linkage clustering labels
complete_ar_score =
# TODO: Calculated the adjusted Rand score for the average linkage clustering labels
avg_ar_score =
哪个算法的调整兰德分数更高?
In [ ]:
print( "Scores: \nWard:", ward_ar_score,"\nComplete: ", complete_ar_score, "\nAverage: ", avg_ar_score)
In [ ]:
iris.data[:15]
查看该数据集后,可以看出第四列的值比其他列要小,因此它的方差对聚类处理流程的影响更新(因为聚类是基于距离的)。我们对数据集进行标准化 ,使每个维度都位于 0 到 1 之间,以便在聚类流程中具有相等的权重。
方法是让每列减去最小值,然后除以范围。
sklearn 提供了一个叫做 preprocessing.normalize()
的实用工具,可以帮助我们完成这一步
In [ ]:
from sklearn import preprocessing
normalized_X = preprocessing.normalize(iris.data)
normalized_X[:10]
现在所有列都在 0 到 1 这一范围内了。这么转换之后对数据集进行聚类会形成更好的聚类吗?(与样本的原始标签更匹配)
In [ ]:
ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(normalized_X)
complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
complete_pred = complete.fit_predict(normalized_X)
avg = AgglomerativeClustering(n_clusters=3, linkage="average")
avg_pred = avg.fit_predict(normalized_X)
ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
avg_ar_score = adjusted_rand_score(iris.target, avg_pred)
print( "Scores: \nWard:", ward_ar_score,"\nComplete: ", complete_ar_score, "\nAverage: ", avg_ar_score)
In [ ]:
# Import scipy's linkage function to conduct the clustering
from scipy.cluster.hierarchy import linkage
# Specify the linkage type. Scipy accepts 'ward', 'complete', 'average', as well as other values
# Pick the one that resulted in the highest Adjusted Rand Score
linkage_type = 'ward'
linkage_matrix = linkage(normalized_X, linkage_type)
使用 scipy 的 dendrogram 函数进行绘制
In [ ]:
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
plt.figure(figsize=(22,18))
# plot using 'dendrogram()'
dendrogram(linkage_matrix)
plt.show()
In [ ]:
import seaborn as sns
sns.clustermap(normalized_X, figsize=(12,18), method=linkage_type, cmap='viridis')
# Expand figsize to a value like (18, 50) if you want the sample labels to be readable
# Draw back is that you'll need more scrolling to observe the dendrogram
plt.show()
查看维度的颜色后,能够发现三种鸢尾花之间的区别吗?你应该至少能够发现一种鸢尾花与其他两种完全不同(位于图形的顶部三分之一区域)。